WEBVTT

00:00:11.077 --> 00:00:14.258
- Okay we have a lot to cover
today so let's get started.

00:00:14.258 --> 00:00:17.454
Today we'll be talking
about Generative Models.

00:00:17.454 --> 00:00:20.484
And before we start, a few
administrative details.

00:00:20.484 --> 00:00:23.522
So midterm grades will be
released on Gradescope this week

00:00:23.522 --> 00:00:27.730
A reminder that A3 is
due next Friday May 26th.

00:00:27.730 --> 00:00:32.709
The HyperQuest deadline for extra credit you
can do this still until Sunday May 21st.

00:00:33.632 --> 00:00:37.799
And our poster session is
June 6th from 12 to 3 P.M..

00:00:40.812 --> 00:00:47.759
Okay so an overview of what we're going to talk about today we're going to
switch gears a little bit and take a look at unsupervised learning today.

00:00:47.759 --> 00:00:54.103
And in particular we're going to talk about generative
models which is a type of unsupervised learning.

00:00:54.103 --> 00:00:57.112
And we'll look at three
types of generative models.

00:00:57.112 --> 00:01:01.174
So pixelRNNs and pixelCNNs
variational autoencoders

00:01:01.174 --> 00:01:04.174
and Generative Adversarial networks.

00:01:05.571 --> 00:01:11.168
So so far in this class we've talked a lot about supervised
learning and different kinds of supervised learning problems.

00:01:11.168 --> 00:01:16.078
So in the supervised learning set up we have
our data X and then we have some labels Y.

00:01:16.078 --> 00:01:21.417
And our goal is to learn a function that's
mapping from our data X to our labels Y.

00:01:21.417 --> 00:01:26.237
And these labels can take
many different types of forms.

00:01:26.237 --> 00:01:34.934
So for example, we've looked at classification where our input is
an image and we want to output Y, a class label for the category.

00:01:34.934 --> 00:01:44.093
We've talked about object detection where now our input is still an image but here
we want to output the bounding boxes of instances of up to multiple dogs or cats.

00:01:46.138 --> 00:01:51.986
We've talked about semantic segmentation where here we have a
label for every pixel the category that every pixel belongs to.

00:01:53.572 --> 00:01:58.961
And we've also talked about image captioning
where here our label is now a sentence

00:01:58.961 --> 00:02:02.961
and so it's now in the
form of natural language.

00:02:03.998 --> 00:02:15.661
So unsupervised learning in this set up, it's a type of learning where here we have unlabeled
training data and our goal now is to learn some underlying hidden structure of the data.

00:02:15.661 --> 00:02:20.370
Right, so an example of this can be something like
clustering which you guys might have seen before

00:02:20.370 --> 00:02:25.029
where here the goal is to find groups within the
data that are similar through some type of metric.

00:02:25.029 --> 00:02:27.187
For example, K means clustering.

00:02:27.187 --> 00:02:32.871
Another example of an unsupervised learning
task is a dimensionality reduction.

00:02:33.777 --> 00:02:38.939
So in this problem want to find axes along which
our training data has the most variation,

00:02:38.939 --> 00:02:43.537
and so these axes are part of the
underlying structure of the data.

00:02:43.537 --> 00:02:51.095
And then we can use this to reduce of dimensionality of the data such that
the data has significant variation among each of the remaining dimensions.

00:02:51.095 --> 00:02:57.842
Right, so this example here we start off with data in three
dimensions and we're going to find two axes of variation in this case

00:02:57.842 --> 00:03:01.259
and reduce our data projected down to 2D.

00:03:04.205 --> 00:03:09.964
Another example of unsupervised learning is
learning feature representations for data.

00:03:11.006 --> 00:03:17.209
We've seen how to do this in supervised ways before where
we used the supervised loss, for example classification.

00:03:17.209 --> 00:03:21.617
Where we have the classification label.
We have something like a Softmax loss

00:03:21.617 --> 00:03:29.869
And we can train a neural network where we can interpret activations for
example our FC7 layers as some kind of future representation for the data.

00:03:29.869 --> 00:03:35.742
And in an unsupervised setting, for example here
autoencoders which we'll talk more about later

00:03:35.742 --> 00:03:46.872
In this case our loss is now trying to reconstruct the input data to basically,
you have a good reconstruction of our input data and use this to learn features.

00:03:46.872 --> 00:03:52.245
So we're learning a feature representation
without using any additional external labels.

00:03:53.471 --> 00:03:59.585
And finally another example of unsupervised learning
is density estimation where in this case we want to

00:03:59.585 --> 00:04:02.884
estimate the underlying
distribution of our data.

00:04:02.884 --> 00:04:10.811
So for example in this top case over here, we have points
in 1-d and we can try and fit a Gaussian into this density

00:04:10.811 --> 00:04:16.605
and in this bottom example over here it's 2D data and
here again we're trying to estimate the density and

00:04:16.605 --> 00:04:24.239
we can model this density. We want to fit a model such that
the density is higher where there's more points concentrated.

00:04:26.100 --> 00:04:35.990
And so to summarize the differences in unsupervised learning which we've looked
a lot so far, we want to use label data to learn a function mapping from X to Y

00:04:35.990 --> 00:04:44.124
and an unsupervised learning we use no labels and instead we try to learn
some underlying hidden structure of the data, whether this is grouping,

00:04:44.124 --> 00:04:48.291
acts as a variation or
underlying density estimation.

00:04:49.662 --> 00:04:54.113
And unsupervised learning is a huge
and really exciting area of research and

00:04:54.113 --> 00:05:04.339
and some of the reasons are that training data is really cheap, it doesn't use labels
so we're able to learn from a lot of data at one time and basically utilize a lot

00:05:04.339 --> 00:05:09.977
more data than if we required annotating
or finding labels for data.

00:05:09.977 --> 00:05:17.823
And unsupervised learning is still relatively unsolved research
area by comparison. There's a lot of open problems in this,

00:05:17.823 --> 00:05:24.669
but it also, it holds the potential of if you're able to
successfully learn and represent a lot of the underlying structure

00:05:24.669 --> 00:05:32.729
in the data then this also takes you a long way towards the Holy
Grail of trying to understand the structure of the visual world.

00:05:35.026 --> 00:05:40.432
So that's a little bit of kind of a high-level
big picture view of unsupervised learning.

00:05:40.432 --> 00:05:44.155
And today will focus more
specifically on generative models

00:05:44.155 --> 00:05:52.933
which is a class of models for unsupervised learning where given training
data our goal is to try and generate new samples from the same distribution.

00:05:52.933 --> 00:05:57.686
Right, so we have training data over here
generated from some distribution P data

00:05:57.686 --> 00:06:04.955
and we want to learn a model, P model to
generate samples from the same distribution

00:06:04.955 --> 00:06:09.854
and so we want to learn P
model to be similar to P data.

00:06:09.854 --> 00:06:12.636
And generative models
address density estimations.

00:06:12.636 --> 00:06:22.180
So this problem that we saw earlier of trying to estimate the underlying
distribution of your training data which is a core problem in unsupervised learning.

00:06:22.180 --> 00:06:25.190
And we'll see that there's
several flavors of this.

00:06:25.190 --> 00:06:33.353
We can use generative models to do explicit density estimation
where we're going to explicitly define and solve for our P model

00:06:35.045 --> 00:06:37.610
or we can also do implicit
density estimation

00:06:37.610 --> 00:06:45.035
where in this case we'll learn a model that can produce
samples from P model without explicitly defining it.

00:06:47.700 --> 00:06:54.096
So, why do we care about generative models? Why is this a
really interesting core problem in unsupervised learning?

00:06:54.096 --> 00:06:57.451
Well there's a lot of things that
we can do with generative models.

00:06:57.451 --> 00:07:04.659
If we're able to create realistic samples from the data distributions
that we want we can do really cool things with this, right?

00:07:04.659 --> 00:07:14.568
We can generate just beautiful samples to start with so on the left you can
see a completely new samples of just generated by these generative models.

00:07:14.568 --> 00:07:21.042
Also in the center here generated samples of
images we can also do tasks like super resolution,

00:07:21.042 --> 00:07:32.145
colorization so hallucinating or filling in these edges with
generated ideas of colors and what the purse should look like.

00:07:32.145 --> 00:07:41.619
We can also use generative models of time series data for simulation and
planning and so this will be useful in for reinforcement learning applications

00:07:41.619 --> 00:07:45.089
which we'll talk a bit more about
reinforcement learning in a later lecture.

00:07:45.089 --> 00:07:50.261
And training generative models can also
enable inference of latent representations.

00:07:50.261 --> 00:07:57.435
Learning latent features that can be useful
as general features for downstream tasks.

00:07:59.059 --> 00:08:05.688
So if we look at types of generative models
these can be organized into the taxonomy here

00:08:05.688 --> 00:08:13.180
where we have these two major branches that we talked
about, explicit density models and implicit density models.

00:08:13.180 --> 00:08:19.062
And then we can also get down into many
of these other sub categories.

00:08:19.062 --> 00:08:27.814
And well we can refer to this figure is adapted
from a tutorial on GANs from Ian Goodfellow

00:08:27.814 --> 00:08:36.861
and so if you're interested in some of these different taxonomy and categorizations
of generative models this is a good resource that you can take a look at.

00:08:36.861 --> 00:08:45.645
But today we're going to discuss three of the most popular types
of generative models that are in use and in research today.

00:08:45.645 --> 00:08:49.475
And so we'll talk first briefly
about pixelRNNs and CNNs

00:08:49.475 --> 00:08:52.162
And then we'll talk about
variational autoencoders.

00:08:52.162 --> 00:08:55.661
These are both types of
explicit density models.

00:08:55.661 --> 00:08:57.494
One that's using a tractable density

00:08:57.494 --> 00:09:01.312
and another that's using
an approximate density

00:09:01.312 --> 00:09:05.614
And then we'll talk about
generative adversarial networks,

00:09:05.614 --> 00:09:09.781
GANs which are a type of
implicit density estimation.

00:09:12.152 --> 00:09:16.304
So let's first talk
about pixelRNNs and CNNs.

00:09:16.304 --> 00:09:20.015
So these are a type of fully
visible belief networks

00:09:20.015 --> 00:09:22.432
which are modeling a density explicitly

00:09:22.432 --> 00:09:34.941
so in this case what they do is we have this image data X that we have and we want to model the
probability or likelihood of this image P of X. Right and so in this case, for these kinds of models,

00:09:34.941 --> 00:09:40.384
we use the chain rule to decompose this likelihood
into a product of one dimensional distribution.

00:09:40.384 --> 00:09:43.493
So we have here the
probability of each pixel X I

00:09:43.493 --> 00:09:47.871
conditioned on all previous
pixels X1 through XI - 1.

00:09:47.871 --> 00:09:58.073
and your likelihood all right, your joint likelihood of all the pixels in your image is
going to be the product of all of these pixels together, all of these likelihoods together.

00:09:58.073 --> 00:10:08.938
And then once we define this likelihood, in order to train this model we can
just maximize the likelihood of our training data under this defined density.

00:10:10.980 --> 00:10:20.833
So if we look at this this distribution over pixel values right, we have this P of
XI given all the previous pixel values, well this is a really complex distribution.

00:10:20.833 --> 00:10:22.700
So how can we model this?

00:10:22.700 --> 00:10:29.042
Well we've seen before that if we want to have complex
transformations we can do these using neural networks.

00:10:29.042 --> 00:10:32.828
Neural networks are a good way to
express complex transformations.

00:10:32.828 --> 00:10:42.300
And so what we'll do is we'll use a neural network to express
this complex function that we have of the distribution.

00:10:43.235 --> 00:10:44.796
And one thing you'll see here is that,

00:10:44.796 --> 00:10:51.212
okay even if we're going to use a neural network for this another
thing we have to take care of is how do we order the pixels.

00:10:51.212 --> 00:10:58.886
Right, I said here that we have a distribution for P of XI given
all previous pixels but what does all previous the pixels mean?

00:10:58.886 --> 00:11:01.303
So we'll take a look at that.

00:11:03.336 --> 00:11:06.669
So PixelRNN was a model proposed in 2016

00:11:07.595 --> 00:11:17.657
that basically defines a way for setting up and
optimizing this problem and so how this model works is

00:11:17.657 --> 00:11:21.187
that we're going to generate pixels
starting in a corner of the image.

00:11:21.187 --> 00:11:31.050
So we can look at this grid as basically the pixels of your image and so
what we're going to do is start from the pixel in the upper left-hand corner

00:11:31.050 --> 00:11:37.195
and then we're going to sequentially generate pixels based
on these connections from the arrows that you can see here.

00:11:37.195 --> 00:11:44.332
And each of the dependencies on the previous pixels
in this ordering is going to be modeled using an RNN

00:11:44.332 --> 00:11:48.092
or more specifically an LSTM which
we've seen before in lecture.

00:11:48.092 --> 00:11:55.242
Right so using this we can basically continue to
move forward just moving down a long is diagonal

00:11:55.242 --> 00:12:01.244
and generating all of these pixel values dependent
on the pixels that they're connected to.

00:12:01.244 --> 00:12:08.736
And so this works really well but the drawback here is that this
sequential generation, right, so it's actually quite slow to do this.

00:12:08.736 --> 00:12:15.061
You can imagine you know if you're going to generate a new image instead
of all of these feed forward networks that we see, we've seen with CNNs.

00:12:15.061 --> 00:12:20.952
Here we're going to have to iteratively go through
and generate all these images, all these pixels.

00:12:24.044 --> 00:12:30.575
So a little bit later, after a pixelRNN,
another model called pixelCNN was introduced.

00:12:30.575 --> 00:12:34.570
And this has very
similar setup as pixelCNN

00:12:34.570 --> 00:12:43.074
and we're still going to do this image generation starting from the corner of the of
the image and expanding outwards but the difference now is that now instead of using

00:12:43.074 --> 00:12:47.752
an RNN to model all these dependencies
we're going to use the CNN instead.

00:12:47.752 --> 00:12:52.179
And we're now going to use a
CNN over a a context region

00:12:52.179 --> 00:12:56.384
that you can see here around in the particular
pixel that we're going to generate now.

00:12:56.384 --> 00:13:09.313
Right so we take the pixels around it, this gray area within the region that's already been
generated and then we can pass this through a CNN and use that to generate our next pixel value.

00:13:11.041 --> 00:13:18.055
And so what this is going to give is this is going to give
This is a CNN, a neural network at each pixel location

00:13:18.055 --> 00:13:22.967
right and so the output of this is going to be
a soft max loss over the pixel values here.

00:13:22.967 --> 00:13:31.193
In this case we have a 0 to 255 and then we can train
this by maximizing the likelihood of the training images.

00:13:31.193 --> 00:13:43.482
Right so we say that basically we want to take a training image we're going to
do this generation process and at each pixel location we have the ground truth

00:13:43.482 --> 00:13:53.976
training data image value that we have here and this is a quick basically the label or
the the the classification label that we want our pixel to be which of these 255 values

00:13:53.976 --> 00:13:56.723
and we can train this
using a Softmax loss.

00:13:56.723 --> 00:14:05.597
Right and so basically the effect of doing this is that we're going to
maximize the likelihood of our training data pixels being generated.

00:14:05.597 --> 00:14:08.413
Okay any questions about this?
Yes.

00:14:08.413 --> 00:14:12.159
[student's words obscured
due to lack of microphone]

00:14:12.159 --> 00:14:18.675
Yeah, so the question is, I thought we were talking about unsupervised
learning, why do we have basically a classification label here?

00:14:18.675 --> 00:14:24.970
The reason is that this loss, this output that
we have is the value of the input training data.

00:14:24.970 --> 00:14:26.983
So we have no external labels, right?

00:14:26.983 --> 00:14:38.533
We didn't go and have to manually collect any labels for this, we're just taking
our input data and saying that this is what we used for the last function.

00:14:41.199 --> 00:14:45.366
[student's words obscured
due to lack of microphone]

00:14:47.998 --> 00:14:50.746
The question is, is
this like bag of words?

00:14:50.746 --> 00:14:53.109
I would say it's not really bag of words,

00:14:53.109 --> 00:15:01.466
it's more saying that we want where we're outputting a distribution over
pixel values at each location of our image right, and what we want to do

00:15:01.466 --> 00:15:10.442
is we want to maximize the likelihood of our input,
our training data being produced, being generated.

00:15:10.442 --> 00:15:15.761
Right so, in that sense, this is why it's
using our input data to create our loss.

00:15:21.006 --> 00:15:24.904
So using pixelCNN training
is faster than pixelRNN

00:15:24.904 --> 00:15:34.301
because here now right at every pixel location we want to maximize the
value of our, we want to maximize the likelihood of our training data

00:15:34.301 --> 00:15:40.739
showing up and so we have all of these values already right,
just from our training data and so we can do this much

00:15:40.739 --> 00:15:47.296
faster but a generation time for a test time we want to
generate a completely new image right, just starting from

00:15:47.296 --> 00:15:59.197
the corner and we're not, we're not trying to do any type of learning so in that generation time
we still have to generate each of these pixel locations before we can generate the next location.

00:15:59.197 --> 00:16:03.025
And so generation time here it still slow
even though training time is faster.

00:16:03.025 --> 00:16:04.204
Question.

00:16:04.204 --> 00:16:08.365
[student's words obscured
due to lack of microphone]

00:16:08.365 --> 00:16:14.077
So the question is, is this training a sensitive
distribution to what you pick for the first pixel?

00:16:14.077 --> 00:16:21.208
Yeah, so it is dependent on what you have as the initial pixel
distribution and then everything is conditioned based on that.

00:16:23.203 --> 00:16:32.171
So again, how do you pick this distribution? So at training time you have
these distributions from your training data and then at generation time

00:16:32.171 --> 00:16:38.368
you can just initialize this with either uniform
or from your training data, however you want.

00:16:38.368 --> 00:16:42.553
And then once you have that everything
else is conditioned based on that.

00:16:42.553 --> 00:16:43.912
Question.

00:16:43.912 --> 00:16:48.079
[student's words obscured
due to lack of microphone]

00:17:07.415 --> 00:17:14.146
Yeah so the question is is there a way that we define this in this
chain rule fashion instead of predicting all the pixels at one time?

00:17:14.146 --> 00:17:17.884
And so we'll see, we'll see
models later that do do this,

00:17:17.884 --> 00:17:27.868
but what the chain rule allows us to do is it allows us to find this very tractable
density that we can then basically optimize and do, directly optimizes likelihood

00:17:31.864 --> 00:17:39.606
Okay so these are some examples of generations from
this model and so here on the left you can see

00:17:39.606 --> 00:17:48.846
generations where the training data is CIFAR-10, CIFAR-10 dataset. And so you can
see that in general they are starting to capture statistics of natural images.

00:17:48.846 --> 00:17:56.848
You can see general types of blobs and kind of things
that look like parts of natural images coming out.

00:17:56.848 --> 00:18:02.768
On the right here it's ImageNet, we can again see samples
from here and these are starting to look like natural images

00:18:05.060 --> 00:18:09.966
but they're still not, there's
still room for improvement.

00:18:09.966 --> 00:18:17.059
You can still see that there are differences obviously with regional
training images and some of the semantics are not clear in here.

00:18:19.371 --> 00:18:27.020
So, to summarize this, pixelRNNs and CNNs allow
you to explicitly compute likelihood P of X.

00:18:27.020 --> 00:18:29.297
It's an explicit density
that we can optimize.

00:18:29.297 --> 00:18:34.043
And being able to do this also has another
benefit of giving a good evaluation metric.

00:18:34.043 --> 00:18:40.958
You know you can kind of measure how good your samples
are by this likelihood of the data that you can compute.

00:18:40.958 --> 00:18:47.043
And it's able to produce pretty good samples
but it's still an active area of research

00:18:47.043 --> 00:18:53.760
and the main disadvantage of these methods is that the
generation is sequential and so it can be pretty slow.

00:18:53.760 --> 00:18:59.324
And these kinds of methods have also been
used for generating audio for example.

00:18:59.324 --> 00:19:08.170
And you can look online for some pretty interesting examples of this, but
again the drawback is that it takes a long time to generate these samples.

00:19:08.170 --> 00:19:14.565
And so there's a lot of work, has been work since
then on still on improving pixelCNN performance

00:19:14.565 --> 00:19:22.346
And so all kinds of different you know architecture changes add the loss
function formulating this differently on different types of training tricks

00:19:22.346 --> 00:19:29.495
And so if you're interested in learning more about
this you can look at some of these papers on PixelCNN

00:19:29.495 --> 00:19:35.115
and then other pixelCNN plus plus better
improved version that came out this year.

00:19:37.455 --> 00:19:44.321
Okay so now we're going to talk about another type
of generative models call variational autoencoders.

00:19:44.321 --> 00:19:52.204
And so far we saw that pixelCNNs defined a tractable
density function, right, using this this definition

00:19:52.204 --> 00:19:58.365
and based on that we can optimize directly
optimize the likelihood of the training data.

00:19:59.419 --> 00:20:04.195
So with variational autoencoders now we're going
to define an intractable density function.

00:20:04.195 --> 00:20:10.769
We're now going to model this with an additional latent
variable Z and we'll talk in more detail about how this looks.

00:20:10.769 --> 00:20:17.886
And so our data likelihood P of X is now
basically has to be this integral right,

00:20:17.886 --> 00:20:21.422
taking the expectation over
all possible values of Z.

00:20:21.422 --> 00:20:26.909
And so this now is going to be a problem. We'll
see that we cannot optimize this directly.

00:20:26.909 --> 00:20:33.706
And so instead what we have to do is we have to derive
and optimize a lower bound on the likelihood instead.

00:20:33.706 --> 00:20:34.956
Yeah, question.

00:20:35.864 --> 00:20:37.592
So the question is is what is Z?

00:20:37.592 --> 00:20:42.862
Z is a latent variable and I'll go
through this in much more detail.

00:20:44.479 --> 00:20:48.538
So let's talk about some background first.

00:20:48.538 --> 00:20:54.733
Variational autoencoders are related to a type of
unsupervised learning model called autoencoders.

00:20:54.733 --> 00:21:00.965
And so we'll talk little bit more first about autoencoders
and what they are and then I'll explain how variational

00:21:00.965 --> 00:21:05.851
autoencoders are related and build off
of this and allow you to generate data.

00:21:05.851 --> 00:21:09.168
So with autoencoders we don't
use this to generate data,

00:21:09.168 --> 00:21:15.719
but it's an unsupervised approach for learning a lower
dimensional feature representation from unlabeled training data.

00:21:15.719 --> 00:21:21.550
All right so in this case we have our input data X and then
we're going to want to learn some features that we call Z.

00:21:22.541 --> 00:21:29.605
And then we'll have an encoder that's going to be a mapping,
a function mapping from this input data to our feature Z.

00:21:30.911 --> 00:21:33.905
And this encoder can take
many different forms right,

00:21:33.905 --> 00:21:41.239
they would generally use neural networks so originally these models
have been around, autoencoders have been around for a long time.

00:21:41.239 --> 00:21:45.803
So in the 2000s we used linear
layers of non-linearities,

00:21:45.803 --> 00:21:54.389
then later on we had fully connected deeper networks and then
after that we moved on to using CNNs for these encoders.

00:21:55.385 --> 00:22:01.351
So we take our input data X and
then we map this to some feature Z.

00:22:01.351 --> 00:22:11.817
And Z we usually have as, we usually specify this to be smaller than
X and we perform basically dimensionality reduction because of that.

00:22:11.817 --> 00:22:17.729
So the question who has an idea of why do we
want to do dimensionality reduction here?

00:22:17.729 --> 00:22:20.896
Why do we want Z to be smaller than X?

00:22:22.114 --> 00:22:25.497
Yeah. [student's words obscured
due to lack of microphone]

00:22:25.497 --> 00:22:31.657
So the answer I heard is Z should represent the
most important features in X and that's correct.

00:22:32.634 --> 00:22:41.758
So we want Z to be able to learn features that can capture meaningful
factors of variation in the data. Right this makes them good features.

00:22:42.833 --> 00:22:46.717
So how can we learn this
feature representation?

00:22:46.717 --> 00:22:55.944
Well the way autoencoders do this is that we train the model such
that the features can be used to reconstruct our original data.

00:22:55.944 --> 00:23:03.730
So what we want is we want to have input data that we use
an encoder to map it to some lower dimensional features Z.

00:23:05.320 --> 00:23:06.926
This is the output of the encoder network,

00:23:06.926 --> 00:23:16.554
and we want to be able to take these features that were produced based on this input
data and then use a decoder a second network and be able to output now something

00:23:16.554 --> 00:23:24.865
of the same size dimensionality as X and have it be similar to X
right so we want to be able to reconstruct the original data.

00:23:26.387 --> 00:23:38.583
And again for the decoder we are basically using same types of networks as encoders so
it's usually a little bit symmetric and now we can use CNN networks for most of these.

00:23:41.675 --> 00:23:48.720
Okay so the process is going to be we're going to take our
input data right we pass it through our encoder first

00:23:48.720 --> 00:23:53.996
which is going to be something for example like a four layer
convolutional network and then we're going to pass it,

00:23:53.996 --> 00:24:04.196
get these features and then we're going to pass it through a decoder which is a four layer for
example upconvolutional network and then get a reconstructed data out at the end of this.

00:24:04.196 --> 00:24:14.409
Right in the reason why we have a convolutional network for the encoder and an
upconvolutional network for the decoder is because at the encoder we're basically

00:24:14.409 --> 00:24:25.893
taking it from this high dimensional input to these lower dimensional features and now we want to go the
other way go from our low dimensional features back out to our high dimensional reconstructed input.

00:24:28.906 --> 00:24:39.071
And so in order to get this effect that we said we wanted before of being able
to reconstruct our input data we'll use something like an L2 loss function.

00:24:39.071 --> 00:24:49.306
Right that basically just says let me make my pixels of my input data to be the same as
my, my pixels in my reconstructed data to be the same as the pixels of my input data.

00:24:51.078 --> 00:24:58.599
An important thing to notice here, this relates back to a question that
we had earlier, is that even though we have this loss function here,

00:24:58.599 --> 00:25:02.515
there's no, there's no external labels
that are being used in training this.

00:25:02.515 --> 00:25:10.861
All we have is our training data that we're going to use both to
pass through the network as well as to compute our loss function.

00:25:13.346 --> 00:25:19.021
So once we have this after training this model
what we can do is we can throw away this decoder.

00:25:19.021 --> 00:25:26.108
All this was used was too to be able to produce our
reconstruction input and be able to compute our loss function.

00:25:26.108 --> 00:25:34.819
And we can use the encoder that we have which produces our feature
mapping and we can use this to initialize a supervised model.

00:25:34.819 --> 00:25:45.773
Right and so for example we can now go from this input to our features and then
have an additional classifier network on top of this that now we can use to output

00:25:45.773 --> 00:25:55.601
a class label for example for classification problem we can have external
labels from here and use our standard loss functions like Softmax.

00:25:55.601 --> 00:26:04.449
And so the value of this is that we basically were able to use a lot of
unlabeled training data to try and learn good general feature representations.

00:26:04.449 --> 00:26:12.363
Right, and now we can use this to initialize a supervised learning problem
where sometimes we don't have so much data we only have small data.

00:26:12.363 --> 00:26:19.697
And we've seen in previous homeworks and classes that
with small data it's hard to learn a model, right?

00:26:19.697 --> 00:26:22.563
You can have over fitting
and all kinds of problems

00:26:22.563 --> 00:26:27.540
and so this allows you to initialize
your model first with better features.

00:26:31.371 --> 00:26:42.329
Okay so we saw that autoencoders are able to reconstruct data and are able to, as a
result, learn features to initialize, that we can use to initialize a supervised model.

00:26:42.329 --> 00:26:50.133
And we saw that these features that we learned have this intuition
of being able to capture factors of variation in the training data.

00:26:50.133 --> 00:26:58.941
All right so based on this intuition of okay these, we can have this
latent this vector Z which has factors of variation in our training data.

00:26:58.941 --> 00:27:04.957
Now a natural question is well can we use a
similar type of setup to generate new images?

00:27:06.922 --> 00:27:09.502
And so now we will talk about
variational autoencoders

00:27:09.502 --> 00:27:15.987
which is a probabillstic spin on autoencoders that will let
us sample from the model in order to generate new data.

00:27:15.987 --> 00:27:19.404
Okay any questions on autoencoders first?

00:27:20.796 --> 00:27:22.828
Okay, so variational autoencoders.

00:27:22.828 --> 00:27:28.914
All right so here we assume that our
training data that we have X I from one to N

00:27:30.255 --> 00:27:34.812
is generated from some underlying,
unobserved latent representation Z.

00:27:34.812 --> 00:27:38.357
Right, so it's this intuition
that Z is some vector

00:27:38.357 --> 00:27:47.069
right which element of Z is capturing how little or how much
of some factor of variation that we have in our training data.

00:27:48.491 --> 00:27:54.811
Right so the intuition is, you know, maybe these could be something like
different kinds of attributes. Let's say we're trying to generate faces,

00:27:54.811 --> 00:28:02.608
it could be how much of a smile is on the face, it could
be position of the eyebrows hair orientation of the head.

00:28:02.608 --> 00:28:08.772
These are all possible types of
latent factors that could be learned.

00:28:08.772 --> 00:28:13.901
Right, and so our generation process is that
we're going to sample from a prior over Z.

00:28:13.901 --> 00:28:25.014
Right so for each of these attributes for example, you know, how much smile that there is, we
can have a prior over what sort of distribution we think that there should be for this so,

00:28:25.014 --> 00:28:31.571
a gaussian is something that's a natural prior
that we can use for each of these factors of Z

00:28:31.571 --> 00:28:40.140
and then we're going to generate our data X by sampling from
a conditional, conditional distribution P of X given Z.

00:28:40.140 --> 00:28:48.862
So we sample Z first, we sample a value for each of these latent
factors and then we'll use that and sample our image X from here.

00:28:51.409 --> 00:28:57.667
And so the true parameters of this generation
process are theta, theta star right?

00:28:57.667 --> 00:29:03.158
So we have the parameters of our prior
and our conditional distributions

00:29:03.158 --> 00:29:11.727
and what we want to do is in order to have a generative model be able to
generate new data we want to estimate these parameters of our true parameters

00:29:14.790 --> 00:29:17.611
Okay so let's first talk about how
should we represent this model.

00:29:20.282 --> 00:29:27.317
All right, so if we're going to have a model for this generator process, well we've
already said before that we can choose our prior P of Z to be something simple.

00:29:27.317 --> 00:29:32.713
Something like a Gaussian, right? And this is the
reasonable thing to choose for for latent attributes.

00:29:35.696 --> 00:29:40.840
Now for our conditional distribution P of X
given Z this is much more complex right,

00:29:40.840 --> 00:29:43.410
because we need to use
this to generate an image

00:29:43.410 --> 00:29:53.062
and so for P of X given Z, well as we saw before, when we have some type of complex
function that we want to represent we can represent this with a neural network.

00:29:53.062 --> 00:29:58.259
And so that's a natural choice for let's try and
model P of X given Z with a neural network.

00:30:00.308 --> 00:30:02.345
And we're going to call
this the decoder network.

00:30:02.345 --> 00:30:10.167
Right, so we're going to think about taking some latent representation
and trying to decode this into the image that it's specifying.

00:30:10.167 --> 00:30:13.765
So now how can we train this model?

00:30:13.765 --> 00:30:19.419
Right, we want to be able to train this model so
that we can learn an estimate of these parameters.

00:30:19.419 --> 00:30:26.668
So if we remember our strategy from training generative models, back
from are fully visible belief networks, our pixelRNNs and CNNs,

00:30:28.577 --> 00:30:35.498
a straightforward natural strategy is to try and learn these model
parameters in order to maximize the likelihood of the training data.

00:30:35.498 --> 00:30:39.346
Right, so we saw earlier that in this case,
with our latent variable Z, we're going to have

00:30:39.346 --> 00:30:49.884
to write out P of X taking expectation over all possible values of Z which is continuous
and so we get this expression here. Right so now we have it with this latent Z

00:30:49.884 --> 00:30:55.759
and now if we're going to, if you want to try and
maximize its likelihood, well what's the problem?

00:30:55.759 --> 00:31:01.372
Can we just take this take gradients
and maximize this likelihood?

00:31:01.372 --> 00:31:04.358
[student's words obscured
due to lack of microphone]

00:31:04.358 --> 00:31:08.524
Right, so this integral is not going
to be tractable, that's correct.

00:31:10.199 --> 00:31:12.547
So let's take a look at this
in a little bit more detail.

00:31:12.547 --> 00:31:18.772
Right, so we have our data likelihood
term here. And the first time is P of Z.

00:31:18.772 --> 00:31:24.847
And here we already said earlier, we can just choose
this to be a simple Gaussian prior, so this is fine.

00:31:24.847 --> 00:31:29.031
P of X given Z, well we said we were going
to specify a decoder neural network.

00:31:29.031 --> 00:31:32.774
So given any Z, we can get
P of X given Z from here.

00:31:32.774 --> 00:31:35.721
It's the output of our neural network.

00:31:35.721 --> 00:31:38.147
But then what's the problem here?

00:31:38.147 --> 00:31:48.435
Okay this was supposed to be a different unhappy face but somehow I don't know
what happened, in the process of translation, it turned into a crying black ghost

00:31:49.298 --> 00:31:58.591
but what this is symbolizing is that basically if we want to
compute P of X given Z for every Z this is now intractable right,

00:31:59.519 --> 00:32:02.186
we cannot compute this integral.

00:32:04.794 --> 00:32:06.591
So data likelihood is intractable

00:32:06.591 --> 00:32:19.639
and it turns out that if we look at other terms in this model if we look at our posterior
density, So P of our posterior of Z given X, then this is going to be P of X given Z

00:32:19.639 --> 00:32:23.712
times P of Z over P of X by Bayes' rule

00:32:23.712 --> 00:32:25.740
and this is also going
to be intractable, right.

00:32:25.740 --> 00:32:35.143
We have P of X given Z is okay, P of Z is okay, but we have this P
of X our likelihood which has the integral and it's intractable.

00:32:36.027 --> 00:32:37.993
So we can't directly optimizes this.

00:32:37.993 --> 00:32:45.230
but we'll see that a solution, a solution
that will enable us to learn this model

00:32:45.230 --> 00:32:54.824
is if in addition to using a decoder network defining this neural network
to model P of X given Z. If we now define an additional encoder network

00:32:54.824 --> 00:33:06.652
Q of Z given X we're going to call this an encoder because we want to turn our
input X into, get the likelihood of Z given X, we're going to encode this into Z.

00:33:06.652 --> 00:33:10.329
And defined this network to approximate
the P of Z given X.

00:33:12.388 --> 00:33:15.688
Right this was posterior density
term now is also intractable.

00:33:15.688 --> 00:33:22.866
If we use this additional network to approximate this
then we'll see that this will actually allow us to derive

00:33:22.866 --> 00:33:27.486
a lower bound on the data likelihood that
is tractable and which we can optimize.

00:33:29.308 --> 00:33:35.396
Okay so first just to be a little bit more concrete about
these encoder and decoder networks that I specified,

00:33:36.579 --> 00:33:40.695
in variational autoencoders we want the
model probabilistic generation of data.

00:33:40.695 --> 00:33:51.530
So in autoencoders we already talked about this concept of having an encoder going from
input X to some feature Z and a decoder network going from Z back out to some image X.

00:33:53.294 --> 00:33:58.907
And so here we go to again have an encoder network and a
decoder network but we're going to make these probabilistic.

00:33:58.907 --> 00:34:06.134
So now our encoder network Q of Z given X with
parameters phi are going to output a mean

00:34:06.134 --> 00:34:09.467
and a diagonal covariance and from here,

00:34:11.411 --> 00:34:14.795
this will be the direct outputs of our
encoder network and the same thing for our

00:34:14.795 --> 00:34:23.109
decoder network which is going to start from Z and now it's
going to output the mean and the diagonal covariance of some X,

00:34:23.109 --> 00:34:26.725
same dimension as the input given Z

00:34:26.725 --> 00:34:29.478
And then this decoder network
has different parameters theta.

00:34:31.136 --> 00:34:42.058
And now in order to actually get our Z and our, This should be Z
given X and X given Z. We'll sample from these distributions.

00:34:42.058 --> 00:34:49.072
So now our encoder and our decoder network are
producing distributions over Z and X respectively

00:34:49.072 --> 00:34:52.409
and will sample from this distribution
in order to get a value from here.

00:34:52.409 --> 00:34:59.630
So you can see how this is taking us on the direction
towards being able to sample and generate new data.

00:34:59.630 --> 00:35:05.041
And just one thing to note is that these encoder and decoder
networks, you'll also hear different terms for them.

00:35:05.041 --> 00:35:09.138
The encoder network can also be kind of
recognition or inference network because

00:35:09.138 --> 00:35:15.913
we're trying to form inference of this latent
representation of Z given X and then for the decoder

00:35:15.913 --> 00:35:18.826
network, this is what we'll
use to perform generation.

00:35:18.826 --> 00:35:22.993
Right so you also hear
generation network being used.

00:35:24.410 --> 00:35:31.899
Okay so now equipped with our encoder and decoder networks,
let's try and work out the data likelihood again.

00:35:31.899 --> 00:35:35.117
and we'll use the log of
the data likelihood here.

00:35:35.117 --> 00:35:38.833
So we'll see that if we
want the log of P of X right

00:35:38.833 --> 00:35:44.988
we can write this out as like a P of X but
take the expectation with respect to Z.

00:35:44.988 --> 00:35:51.053
So Z samples from our distribution of Q of Z given X
that we've now defined using the encoder network.

00:35:52.606 --> 00:35:58.254
And we can do this because P of X doesn't depend
on Z. Right 'cause Z is not part of that.

00:35:58.254 --> 00:36:04.794
And so we'll see that taking the expectation with
respect to Z is going to come in handy later on.

00:36:06.255 --> 00:36:20.564
Okay so now from this original expression we can now expand it out to be log of P of X given Z,
P of Z over P of Z given X using Bayes' rule. And so this is just directly writing this out.

00:36:20.564 --> 00:36:24.996
And then taking this we can also
now multiply it by a constant.

00:36:24.996 --> 00:36:30.874
Right, so Q of Z given X over Q of Z
given X. This is one we can do this.

00:36:30.874 --> 00:36:33.847
It doesn't change it but it's
going to be helpful later on.

00:36:33.847 --> 00:36:39.444
So given that what we'll do is we'll write
it out into these three separate terms.

00:36:39.444 --> 00:36:44.703
And you can work out this math later on by yourself
but it's essentially just using logarithm rules

00:36:44.703 --> 00:36:54.728
taking all of these terms that we had in the line above and just separating
it out into these three different terms that will have nice meanings.

00:36:56.431 --> 00:37:02.754
Right so if we look at this, the first term that we get
separated out is log of P given X and then expectation

00:37:02.754 --> 00:37:07.210
of log of P given X and then we're
going to have two KL terms, right.

00:37:07.210 --> 00:37:14.400
This is basically KL divergence term to say
how close these two distributions are.

00:37:14.400 --> 00:37:18.567
So how close is a distribution
Q of Z given X to P of Z.

00:37:19.489 --> 00:37:24.287
So it's just the, it's exactly
this expectation term above.

00:37:24.287 --> 00:37:28.454
And it's just a distance
metric for distributions.

00:37:30.908 --> 00:37:36.183
And so we'll see that, right, we saw that these
are nice KL terms that we can write out.

00:37:36.183 --> 00:37:39.290
And now if we look at these
three terms that we have here,

00:37:39.290 --> 00:37:45.819
the first term is P of X given Z, which
is provided by our decoder network.

00:37:45.819 --> 00:37:52.042
And we're able to compute an estimate of these
term through sampling and we'll see that we can

00:37:52.042 --> 00:37:56.099
do a sampling that's differentiable through something
called the re-parametrization trick which is a

00:37:56.099 --> 00:37:59.920
detail that you can look at this
paper if you're interested.

00:37:59.920 --> 00:38:02.479
But basically we can
now compute this term.

00:38:02.479 --> 00:38:08.600
And then these KL terms, the second KL
term is a KL between two Gaussians,

00:38:08.600 --> 00:38:16.079
so our Q of Z given X, remember our encoder produced this distribution
which had a mean and a covariance, it was a nice Gaussian.

00:38:16.079 --> 00:38:19.892
And then also our prior P of
Z which is also a Gaussian.

00:38:19.892 --> 00:38:25.628
And so this has a nice, when you have a KL of two Gaussians
you have a nice closed form solution that you can have.

00:38:25.628 --> 00:38:31.324
And then this third KL term now, this is
a KL of Q given X with a P of Z given X.

00:38:32.303 --> 00:38:36.766
But we know that P of Z given X was this
intractable posterior that we saw earlier, right?

00:38:36.766 --> 00:38:41.794
That we didn't want to compute that's
why we had this approximation using Q.

00:38:41.794 --> 00:38:44.625
And so this term is still is a problem.

00:38:44.625 --> 00:38:54.776
But one thing we do know about this term is that KL divergence, it's a distance
between two distributions is always greater than or equal to zero by definition.

00:38:57.060 --> 00:39:03.396
And so what we can do with this is that, well what we have
here, the two terms that we can work nicely with, this is a,

00:39:03.396 --> 00:39:10.023
this is a tractable lower bound which we can
actually take gradient of and optimize.

00:39:10.023 --> 00:39:16.652
P of X given Z is differentiable and the KL terms are
also, the close form solution is also differentiable.

00:39:16.652 --> 00:39:24.168
And this is a lower bound because we know that the KL term
on the right, the ugly one is greater than or equal it zero.

00:39:24.168 --> 00:39:26.251
So we have a lower bound.

00:39:27.273 --> 00:39:37.699
And so what we'll do to train a variational autoencoder is that we take this
lower bound and we instead optimize and maximize this lower bound instead.

00:39:37.699 --> 00:39:42.251
So we're optimizing a lower bound
on the likelihood of our data.

00:39:42.251 --> 00:39:49.940
So that means that our data is always going to have a likelihood
that's at least as high as this lower bound that we're maximizing.

00:39:49.940 --> 00:39:58.941
And so we want to find the parameters theta, estimate
parameters theta and phi that allows us to maximize this.

00:40:03.169 --> 00:40:06.412
And then one last sort of
intuition about this lower bound

00:40:06.412 --> 00:40:12.796
that we have is that this first term
is expectation over all samples of Z

00:40:12.796 --> 00:40:22.699
sampled from passing our X through the encoder network sampling Z
taking expectation over all of these samples of likelihood of X given Z

00:40:24.963 --> 00:40:26.854
and so this is a reconstruction, right?

00:40:26.854 --> 00:40:33.300
This is basically saying, if I want this to be big
I want this likelihood P of X given Z to be high,

00:40:33.300 --> 00:40:37.756
so it's kind of like trying to do a
good job reconstructing the data.

00:40:37.756 --> 00:40:40.528
So similar to what we had
from our autoencoder before.

00:40:40.528 --> 00:40:44.695
But the second term here is
saying make this KL small.

00:40:46.161 --> 00:40:51.283
Make our approximate posterior distribution
close to our prior distribution.

00:40:51.283 --> 00:41:04.558
And this basically is saying that well we want our latent variable Z to be following
this, have this distribution type, distribution shape that we would like it to have.

00:41:08.974 --> 00:41:12.058
Okay so any questions about this?

00:41:12.058 --> 00:41:19.128
I think this is a lot of math that if you guys are interested you should
go back and kind of work through all of the derivations yourself.

00:41:19.128 --> 00:41:19.961
Yeah.

00:41:20.883 --> 00:41:23.669
[student's words obscured
due to lack of microphone]

00:41:23.669 --> 00:41:29.373
So the question is why do we specify the
prior and the latent variables as Gaussian?

00:41:29.373 --> 00:41:33.512
And the reason is that well we're defining
some sort of generative process right,

00:41:33.512 --> 00:41:35.930
of sampling Z first and
then sampling X first.

00:41:35.930 --> 00:41:53.307
And defining it as a Gaussian is a reasonable type of prior that we can say makes sense for these types of latent
attributes to be distributed according to some sort of Gaussian, and then this lets us now then optimize our model.

00:41:55.988 --> 00:42:06.053
Okay, so we talked about how we can deride this lower bound and now let's put
this all together and walk through the process of the training of the AE.

00:42:06.053 --> 00:42:10.008
Right so here's the bound that we
want to optimize, to maximize.

00:42:10.008 --> 00:42:19.301
And now for a forward pass. We're going to proceed in the following
manner. We have our input data X, so we'll a mini batch of input data.

00:42:20.845 --> 00:42:26.544
And then we'll pass it through our encoder
network so we'll get Q of Z given X.

00:42:28.439 --> 00:42:35.805
And from this Q of Z given X, this'll be the
terms that we use to compute the KL term.

00:42:35.805 --> 00:42:46.856
And then from here we'll sample Z from this distribution of Z given X
so we have a sample of the latent factors that we can infer from X.

00:42:50.721 --> 00:42:54.889
And then from here we're going to pass a Z
through another, our second decoder network.

00:42:54.889 --> 00:43:07.686
And from the decoder network we'll get this output for the mean and variance on our distribution
for X given Z and then finally we can sample now our X given Z from this distribution

00:43:07.686 --> 00:43:12.155
and here this will produce
some sample output.

00:43:12.155 --> 00:43:23.517
And when we're training we're going to take this distribution and say well
our loss term is going to be log of our training image pixel values given Z.

00:43:23.612 --> 00:43:30.684
So our loss functions going to say let's maximize the
likelihood of this original input being reconstructed.

00:43:32.020 --> 00:43:35.919
And so now for every mini batch of input
we're going to compute this forward pass.

00:43:35.919 --> 00:43:43.837
Get all these terms that we need and then this is all differentiable
so then we just backprop though all of this and then get our gradient,

00:43:43.837 --> 00:43:57.040
we update our model and we use this to continuously update our parameters, our generator and
decoder network parameters theta and phi in order to maximize the likelihood of the trained data.

00:43:58.408 --> 00:44:05.547
Okay so once we've trained our VAE, so now to generate data,
what we can do is we can use just the decoder network.

00:44:05.547 --> 00:44:15.504
All right, so from here we can sample Z now, instead of sampling Z from this posterior that
we had during training, while during generation we sample from our true generative process.

00:44:15.504 --> 00:44:18.673
So we sample from our
prior that we specify.

00:44:18.673 --> 00:44:22.840
And then we're going to then
sample our data X from here.

00:44:25.281 --> 00:44:34.798
And we'll see that this can produce, in this case, train on MNIST,
these are samples of digits generated from a VAE trained on MNIST.

00:44:36.058 --> 00:44:43.796
And you can see that, you know, we talked about this
idea of Z representing these latent factors where we can

00:44:43.796 --> 00:44:52.625
bury Z right according to our sample from different parts of our prior
and then get different kind of interpretable meanings from here.

00:44:52.625 --> 00:44:57.142
So here we can see that this is
the data manifold for two dimensional Z.

00:44:57.142 --> 00:45:08.568
So if we have a two dimensional Z and we take Z and let's say some range from you
know, from different percentiles of the distribution, and we vary Z1 and we vary Z2,

00:45:08.568 --> 00:45:16.300
then you can see how the image generated from
every combination of Z1 and Z2 that we have here,

00:45:16.300 --> 00:45:22.087
you can see it's transitioning smoothly
across all of these different variations.

00:45:24.051 --> 00:45:27.808
And you know our prior on
Z was, it was diagonal,

00:45:27.808 --> 00:45:43.006
so we chose this in order to encourage this to be independent latent variables that can then encode interpretable factors of
variation. So because of this now we'll have different dimensions of Z, encoding different interpretable factors of variation.

00:45:44.477 --> 00:45:54.771
So, in this example train now on Faces, we'll see as we vary Z1,
going up and down, you'll see the amount of smile changing.

00:45:54.771 --> 00:46:00.225
So from a frown at the top to like a big smile
at the bottom and then as we go vary Z2,

00:46:01.997 --> 00:46:07.859
from left to right, you can see the head pose changing.
From one direction all the way to the other.

00:46:09.883 --> 00:46:18.526
And so one additional thing I want to point out is that as a result of
doing this, these Z variables are also good feature representations.

00:46:19.510 --> 00:46:26.376
Because they encode how much of these different these
different interpretable semantics that we have.

00:46:26.376 --> 00:46:32.296
And so we can use our Q of Z given X, the
encoder that we've learned and give it an input

00:46:32.296 --> 00:46:42.249
images X, we can map this to Z and use the Z as features that we can use for
downstream tasks like supervision, or like classification or other tasks.

00:46:47.348 --> 00:46:51.434
Okay so just another couple of
examples of data generated from VAEs.

00:46:51.434 --> 00:47:02.231
So on the left here we have data generated on CIFAR-10, trained on CIFAR-10,
and then on the right we have data trained and generated on Faces.

00:47:02.231 --> 00:47:08.737
And we'll see so we can see that in general
VAEs are able to generate recognizable data.

00:47:08.737 --> 00:47:15.493
One of the main drawbacks of VAEs is that they tend
to still have a bit of a blurry aspect to them.

00:47:15.493 --> 00:47:20.520
You can see this in the faces and so this
is still an active area of research.

00:47:22.008 --> 00:47:28.030
Okay so to summarize VAEs, they're a
probabilistic spin on traditional autoencoders.

00:47:28.030 --> 00:47:36.077
So instead of deterministically taking your input X and
going to Z, feature Z and then back to reconstructing X,

00:47:36.077 --> 00:47:43.023
now we have this idea of distributions and sampling
involved which allows us to generate data.

00:47:43.023 --> 00:47:51.101
And in order to train this, VAEs are defining an intractable
density. So we can derive and optimize a lower bound,

00:47:51.101 --> 00:47:59.718
a variational lower bound, so variational means basically using
approximations to handle these types of intractable expressions.

00:47:59.718 --> 00:48:03.577
And so this is why this is
called a variational autoencoder.

00:48:03.577 --> 00:48:10.249
And so some of the advantages of this approach
is that VAEs are, they're a principled approach

00:48:10.249 --> 00:48:17.628
to generative models and they also allow this inference
query so being able to infer things like Q of Z given X.

00:48:17.628 --> 00:48:21.554
That we said could be useful feature
representations for other tasks.

00:48:23.101 --> 00:48:29.548
So disadvantages of VAEs are that while we're maximizing
the lower bound of the likelihood, which is okay

00:48:29.548 --> 00:48:37.782
like you know in general this is still pushing us in the right
direction and there's more other theoretical analysis of this.

00:48:37.782 --> 00:48:48.378
So you know, it's doing okay, but it's maybe not still as direct an
optimization and evaluation as the pixel RNNs and CNNs that we saw earlier,

00:48:48.378 --> 00:49:03.348
but which had, and then, also the VAE samples are tending to be a little bit blurrier and of lower quality compared
to state of the art samples that we can see from other generative models such as GANs that we'll talk about next.

00:49:04.827 --> 00:49:08.647
And so VAEs now are still, they're
still an active area of research.

00:49:11.044 --> 00:49:13.447
People are working on more
flexible approximations,

00:49:13.447 --> 00:49:20.881
so richer approximate posteriors, so instead of just
a diagonal Gaussian some richer functions for this.

00:49:20.881 --> 00:49:26.992
And then also, another area that people have been working on
is incorporating more structure in these latent variables.

00:49:26.992 --> 00:49:31.282
So now we had all of these
independent latent variables

00:49:31.282 --> 00:49:38.077
but people are working on having modeling structure
in here, groupings, other types of structure.

00:49:41.106 --> 00:49:43.106
Okay, so yeah, question.

00:49:44.404 --> 00:49:47.529
[student's words obscured
due to lack of microphone]

00:49:47.529 --> 00:49:51.394
Yeah, so the question is we're deciding the
dimensionality of the latent variable.

00:49:51.394 --> 00:49:54.727
Yeah, that's something that you specify.

00:49:55.874 --> 00:50:07.481
Okay, so we've talked so far about pixelCNNs and VAEs and now we'll take
a look at a third and very popular type of generative model called GANs.

00:50:10.019 --> 00:50:15.713
So the models that we've seen so far, pixelCNNs
and RNNs define a tractable density function.

00:50:15.713 --> 00:50:19.752
And they optimize the
likelihood of the trained data.

00:50:19.752 --> 00:50:27.752
And then VAEs in contrast to that now have this additional
latent variable Z that they define in the generative process.

00:50:27.752 --> 00:50:36.858
And so having the Z has a lot of nice properties that we talked about, but
they are also cause us to have this intractable density function that we can't

00:50:36.858 --> 00:50:43.934
optimize directly and so we derive and optimize
a lower bound on the likelihood instead.

00:50:43.934 --> 00:50:48.486
And so now what if we just give up on
explicitly modeling this density at all?

00:50:48.486 --> 00:50:55.267
And we say well what we want is just the ability to
sample and to have nice samples from our distribution.

00:50:56.501 --> 00:50:59.175
So this is the approach that GANs take.

00:50:59.175 --> 00:51:02.637
So in GANs we don't work with
an explicit density function,

00:51:02.637 --> 00:51:05.642
but instead we're going to
take a game-theoretic approach

00:51:05.642 --> 00:51:13.839
and we're going to learn to generate from our training distribution through
a set up of a two player game, and we'll talk about this in more detail.

00:51:15.255 --> 00:51:24.681
So, in the GAN set up we're saying, okay well what we want, what we care about is
we want to be able to sample from a complex high dimensional training distribution.

00:51:24.681 --> 00:51:31.170
So if we think about well we want to produce samples from
this distribution, there's no direct way that we can do this.

00:51:31.170 --> 00:51:35.078
We have this very complex distribution,
we can't just take samples from here.

00:51:35.078 --> 00:51:46.875
So the solution that we're going to take is that we can, however, sample from simpler
distributions. For example random noise, right? Gaussians are, these we can sample from.

00:51:46.875 --> 00:51:56.789
And so what we're going to do is we're going to learn a transformation from
these simple distributions directly to the training distribution that we want.

00:51:58.790 --> 00:52:04.304
So the question, what can we used to
represent this complex distribution?

00:52:06.120 --> 00:52:07.718
Neural network, I heard the answer.

00:52:07.718 --> 00:52:14.373
So when we want to model some kind of complex
function or transformation we use a neural network.

00:52:14.373 --> 00:52:23.297
Okay so what we're going to do is we're going to take in the GAN set up, we're
going to take some input which is a vector of some dimension that we specify

00:52:23.297 --> 00:52:33.628
of random noise and then we're going to pass this through a generator network, and
then we're going to get as output directly a sample from the training distribution.

00:52:33.628 --> 00:52:40.154
So every input of random noise we want to correspond
to a sample from the training distribution.

00:52:41.278 --> 00:52:48.737
And so the way we're going to train and learn this network
is that we're going to look at this as a two player game.

00:52:48.737 --> 00:52:54.595
So we have two players, a generator network as well as
an additional discriminator network that I'll show next.

00:52:54.595 --> 00:53:04.320
And our generator network is going to try to, as player one, it's going
to try to fool the discriminator by generating real looking images.

00:53:04.320 --> 00:53:12.462
And then our second player, our discriminator network is then
going to try to distinguish between real and fake images.

00:53:12.462 --> 00:53:23.323
So it wants to do as good a job as possible of trying to determine which of
these images are counterfeit or fake images generated by this generator.

00:53:25.425 --> 00:53:27.324
Okay so what this looks like is,

00:53:27.324 --> 00:53:31.203
we have our random noise going
to our generator network,

00:53:31.203 --> 00:53:36.121
generator network is generating these images that
we're going to call, they're fake from our generator.

00:53:36.121 --> 00:53:42.439
And then we're going to also have real images that
we take from our training set and then we want the

00:53:42.439 --> 00:53:50.881
discriminator to be able to distinguish
between real and fake images.

00:53:50.881 --> 00:53:52.849
Output real and fake for each images.

00:53:52.849 --> 00:54:01.638
So the idea is if we're able to have a very good discriminator, we want to train a
good discriminator, if it can do a good job of discriminating real versus fake,

00:54:01.638 --> 00:54:11.140
and then if our generator network is able to generate, if it's able to do
well and generate fake images that can successfully fool this discriminator,

00:54:11.140 --> 00:54:13.135
then we have a good generative model.

00:54:13.135 --> 00:54:17.431
We're generating images that look
like images from the training set.

00:54:19.482 --> 00:54:25.548
Okay, so we have these two players and so we're going
to train this jointly in a minimax game formulation.

00:54:25.548 --> 00:54:28.941
So this minimax objective
function is what we have here.

00:54:28.941 --> 00:54:37.399
We're going to take, it's going to be minimum over
theta G our parameters of our generator network G,

00:54:37.399 --> 00:54:44.848
and maximum over parameter Zeta of our Discriminator
network D, of this objective, right, these terms.

00:54:47.177 --> 00:54:49.624
And so if we look at these
terms, what this is saying

00:54:49.624 --> 00:54:54.910
is well this first thing, expectation
over data of log of D given X.

00:54:56.094 --> 00:55:01.151
This log of D of X is the
discriminator output for real data X.

00:55:01.151 --> 00:55:09.309
This is going to be likelihood of real data
being real from the data distribution P data.

00:55:09.309 --> 00:55:16.882
And then the second term here, expectation of Z drawn
from P of Z, Z drawn from P of Z means samples from

00:55:16.882 --> 00:55:27.577
our generator network and this term D of G of Z that we have here
is the output of our discriminator for generated fake data for our,

00:55:29.109 --> 00:55:33.769
what does the discriminator output
of G of Z which is our fake data.

00:55:36.311 --> 00:55:43.105
And so if we think about this is trying to do, our
discriminator wants to maximize this objective, right,

00:55:43.105 --> 00:55:53.278
it's a max over theta D such that D of X is close to
one. It's close to real, it's high for the real data.

00:55:53.278 --> 00:56:02.679
And then D of G of X, what it thinks of the fake data on
the left here is small, we want this to be close to zero.

00:56:02.679 --> 00:56:09.237
So if we're able to maximize this, this means discriminator
is doing a good job of distinguishing between real and zero.

00:56:09.237 --> 00:56:13.449
Basically classifying
between real and fake data.

00:56:13.449 --> 00:56:22.375
And then our generator, here we want the generator to minimize
this objective such that D of G of Z is close to one.

00:56:22.375 --> 00:56:35.236
So if this D of G of Z is close to one over here, then the one minus side is
small and basically we want to, if we minimize this term then, then it's having

00:56:36.768 --> 00:56:39.175
discriminator think that our
fake data's actually real.

00:56:39.175 --> 00:56:44.087
So that means that our generator
is producing real samples.

00:56:44.087 --> 00:56:51.139
Okay so this is the important objective of GANs to try
and understand so are there any questions about this?

00:56:51.139 --> 00:57:01.360
[student's words obscured due to lack of microphone] I'm not sure I understand
your question, can you, [student's words obscured due to lack of microphone]

00:57:12.334 --> 00:57:23.067
Yeah, so the question is is this basically trying to have the first network produce real
looking images that our second network, the discriminator cannot distinguish between.

00:57:30.474 --> 00:57:36.809
Okay, so the question is how do we actually label
the data or do the training for these networks.

00:57:36.809 --> 00:57:46.180
We'll see how to train the networks next. But in terms of like what is the
data label basically, this is unsupervised, so there's no data labeling.

00:57:46.180 --> 00:57:52.805
But data generated from the generator network, the
fake images have a label of basically zero or fake.

00:57:52.805 --> 00:58:00.344
And we can take training images that are real images
and this basically has a label of one or real.

00:58:00.344 --> 00:58:04.866
So when we have, the loss function
for our discriminator is using this.

00:58:04.866 --> 00:58:09.819
It's trying to output a zero for the generator
images and a one for the real images.

00:58:09.819 --> 00:58:12.048
So there's no external labels.

00:58:12.048 --> 00:58:15.136
[student's words obscured
due to lack of microphone]

00:58:15.136 --> 00:58:22.119
So the question is the label for the generator network
will be the output for the discriminator network.

00:58:22.119 --> 00:58:29.321
The generator is not really doing, it's not
really doing classifications necessarily.

00:58:29.321 --> 00:58:35.536
What it's objective is is here, D of
G of Z, it wants this to be high.

00:58:35.536 --> 00:58:42.487
So given a fixed discriminator, it wants to learn
the generator parameter such that this is high.

00:58:42.487 --> 00:58:47.752
So we'll take the fixed discriminator
output and use that to do the backprop.

00:58:51.447 --> 00:58:54.219
Okay so in order to train
this, what we're going to do

00:58:54.219 --> 00:58:57.714
is we're going to alternate
between gradient ascent

00:58:57.714 --> 00:59:05.222
on our discriminator, so we're trying to learn
theta beta to maximizing this objective.

00:59:05.222 --> 00:59:08.059
And then gradient
descent on the generator.

00:59:08.059 --> 00:59:15.698
So taking gradient ascent on these parameters theta G
such that we're minimizing this and this objective.

00:59:15.698 --> 00:59:23.748
And here we are only taking this right part over here because
that's the only part that's dependent on theta G parameters.

00:59:26.574 --> 00:59:30.603
Okay so this is how we can train this GAN.

00:59:30.603 --> 00:59:35.716
We can alternate between training our discriminator
and our generator in this game, each trying to fool

00:59:35.716 --> 00:59:40.561
the other or generator trying
to fool the discriminator.

00:59:40.561 --> 00:59:50.478
But one thing that is important to note is that in practice this generator
objective as we've just defined actually doesn't work that well.

00:59:50.478 --> 00:59:55.309
And the reason for this is we have
to look at the loss landscape.

00:59:55.309 --> 01:00:01.059
So if we look at the loss landscape
over here for D of G of X,

01:00:02.858 --> 01:00:10.654
if we apply here one minus D of G of X which is what we
want to minimize for the generator, it has this shape here.

01:00:12.748 --> 01:00:21.119
So we want to minimize this and it turns out the slope of
this loss is actually going to be higher towards the right.

01:00:21.119 --> 01:00:24.369
High when D of G of Z is closer to one.

01:00:26.915 --> 01:00:36.837
So that means that when our generator is doing a good job of fooling the
discriminator, we're going to have a high gradient, more higher gradient terms.

01:00:36.837 --> 01:00:44.794
And on the other hand when we have bad samples, our generator has
not learned a good job yet, it's not good at generating yet,

01:00:44.794 --> 01:00:52.159
then this is when the discriminator can easily tell
it's now closer to this zero region on the X axis.

01:00:53.002 --> 01:00:55.482
Then here the gradient's relatively flat.

01:00:55.482 --> 01:01:03.977
And so what this actually means is that our our gradient signal
is dominated by region where the sample is already pretty good.

01:01:05.200 --> 01:01:12.624
Whereas we actually want it to learn a lot when the samples are
bad, right? These are training samples that we want to learn from.

01:01:12.624 --> 01:01:21.664
And so in order to, so this basically makes it
hard to learn and so in order to improve learning,

01:01:21.664 --> 01:01:26.320
what we're going to do is define a different, slightly
different objective function for the gradient.

01:01:26.320 --> 01:01:30.145
Where now we're going to
do gradient ascent instead.

01:01:30.145 --> 01:01:35.748
And so instead of minimizing the likelihood of our
discriminator being correct, which is what we had earlier,

01:01:35.748 --> 01:01:40.908
now we'll kind of flip it and say let's maximize
the likelihood of our discriminator being wrong.

01:01:40.908 --> 01:01:49.720
And so this will produce this objective here
of maximizing, maximizing log of D of G of X.

01:01:50.767 --> 01:01:55.102
And so, now basically we want to,
there should be a negative sign here.

01:01:59.160 --> 01:02:08.659
But basically we want to now maximize this flip objective
instead and what this now does is if we plot this function

01:02:10.118 --> 01:02:16.149
on the right here, then we have a high gradient signal
in this region on the left where we have bad samples,

01:02:16.149 --> 01:02:23.242
and now the flatter region is to the
right where we would have good samples.

01:02:23.242 --> 01:02:26.571
So now we're going to learn more
from regions of bad samples.

01:02:26.571 --> 01:02:35.990
And so this has the same objective of fooling the discriminator but it
actually works much better in practice and for a lot of work on GANs that are

01:02:35.990 --> 01:02:41.492
using these kind of vanilla GAN formulation
is actually using this objective.

01:02:44.220 --> 01:02:59.079
Okay so just an aside on that is that jointly training these two networks is challenging and can be unstable.
So as we saw here, like we're alternating between training a discriminator and training a generator.

01:02:59.079 --> 01:03:08.398
This type of alternation is, basically it's hard to
learn two networks at once and there's also this issue

01:03:08.398 --> 01:03:13.815
of depending on what our loss landscape looks
at, it can affect our training dynamics.

01:03:13.815 --> 01:03:23.342
So an active area of research still is how can we choose objectives with
better loss landscapes that can help training and make it more stable?

01:03:26.516 --> 01:03:31.152
Okay so now let's put this all together and
look at the full GAN training algorithm.

01:03:31.152 --> 01:03:34.366
So what we're going to do is
for each iteration of training

01:03:34.366 --> 01:03:41.078
we're going to first train the generation, train the discriminator
network a bit and then train the generator network.

01:03:41.078 --> 01:03:43.959
So for k steps of training
the discriminator network

01:03:43.959 --> 01:03:55.859
we'll sample a mini batch of noise samples from our noise prior Z and
then also sample a mini batch of real samples from our training data X.

01:03:57.366 --> 01:04:04.519
So what we'll do is we'll pass the noise through
our generator, we'll get our fake images out.

01:04:04.519 --> 01:04:08.052
So we have a mini batch of fake
images and mini batch of real images.

01:04:08.052 --> 01:04:15.041
And then we'll pick a gradient step on the discriminator
using this mini batch, our fake and our real images

01:04:15.041 --> 01:04:17.891
and then update our
discriminator parameters.

01:04:17.891 --> 01:04:24.313
And use this and do this a certain number of iterations
to train the discriminator for a bit basically.

01:04:24.313 --> 01:04:28.803
And then after that we'll go to our second
step which is training the generator.

01:04:28.803 --> 01:04:32.544
And so here we'll sample just
a mini batch of noise samples.

01:04:32.544 --> 01:04:43.102
We'll pass this through our generator and then now we want to do backpop
on this to basically optimize our generator objective that we saw earlier.

01:04:45.078 --> 01:04:49.705
So we want to have our generator fool
our discriminator as much as possible.

01:04:50.773 --> 01:04:58.895
And so we're going to alternate between these two steps of taking
gradient steps for our discriminator and for the generator.

01:04:59.996 --> 01:05:07.709
And I said for k steps up here, for training the
discriminator and so this is kind of a topic of debate.

01:05:08.604 --> 01:05:15.391
Some people think just having one iteration of discriminator
one type of discriminator, one type of generator is best.

01:05:15.391 --> 01:05:20.744
Some people think it's better to train the discriminator
for a little bit longer before switching to the generator.

01:05:20.744 --> 01:05:30.732
There's no real clear rule and it's something that people have
found different things to work better depending on the problem.

01:05:30.732 --> 01:05:45.028
And one thing I want to point out is that there's been a lot of recent work that alleviates this problem and
makes it so you don't have to spend so much effort trying to balance how the training of these two networks.

01:05:45.028 --> 01:05:47.880
It'll have more stable training
and give better results.

01:05:47.880 --> 01:05:55.655
And so Wasserstein GAN is an example of a paper
that was an important work towards doing this.

01:06:00.313 --> 01:06:09.767
Okay so looking at the whole picture we've now trained, we have our network
setup, we've trained both our generator network and our discriminator network

01:06:09.767 --> 01:06:16.899
and now after training for generation, we can just take our
generator network and use this to generate new images.

01:06:16.899 --> 01:06:21.520
So we just take noise Z and pass this
through and generate fake images from here.

01:06:23.636 --> 01:06:28.351
Okay and so now let's look at some
generated samples from these GANs.

01:06:28.351 --> 01:06:33.099
So here's an example of trained on MNIST
and then on the right on Faces.

01:06:33.099 --> 01:06:43.849
And for each of these you can also see, just for visualization the closest, on the
right, the nearest neighbor from the training set to the column right next to it.

01:06:43.849 --> 01:06:49.227
And so you can see that we're able to generate very realistic
samples and it never directly memorizes the training set.

01:06:51.264 --> 01:06:56.061
And here are some examples from the
original GAN paper on CIFAR images.

01:06:56.061 --> 01:07:07.374
And these are still fairly, not such good quality yet, these were, the
original work is from 2014, so these are some older, simpler networks.

01:07:07.374 --> 01:07:11.541
And these were using simple,
fully connected networks.

01:07:12.550 --> 01:07:16.018
And so since that time there's been
a lot of work on improving GANs.

01:07:18.120 --> 01:07:31.388
One example of a work that really took a big step towards improving the quality of samples
is this work from Alex Radford in ICLR 2016 on adding convolutional architectures to GANs.

01:07:33.806 --> 01:07:42.958
In this paper there was a whole set of guidelines on
architectures for helping GANs to produce better samples.

01:07:42.958 --> 01:07:46.517
So you can look at this for more details.

01:07:46.517 --> 01:07:52.669
This is an example of a convolutional architecture
that they're using which is going from our input Z

01:07:52.669 --> 01:07:57.694
noise vector Z and transforming this
all the way to the output sample.

01:08:00.527 --> 01:08:08.251
So now from this large convolutional architecture we'll see that
the samples from this model are really starting to look very good.

01:08:08.251 --> 01:08:11.408
So this is trained on
a dataset of bedrooms

01:08:11.408 --> 01:08:15.575
and we can see all kinds of
very realistic fancy looking

01:08:16.783 --> 01:08:26.063
bedrooms with windows and night stands and other furniture
around there so these are some really pretty samples.

01:08:26.064 --> 01:08:32.346
And we can also try and interpret a
little bit of what these GANs are doing.

01:08:32.346 --> 01:08:42.817
So in this example here what we can do is we can take two points of Z, two
different random noise vectors and let's just interpolate between these points.

01:08:42.818 --> 01:08:50.142
And each row across here is an interpolation from
one random noise Z to another random noise vector Z

01:08:50.142 --> 01:08:57.072
and you can see that as it's changing, it's smoothly
interpolating the image as well all the way over.

01:08:59.286 --> 01:09:02.067
And so something else that
we can do is we can see that,

01:09:02.067 --> 01:09:10.313
well, let's try to analyze further what these vectors
Z mean, and so we can try and do vector math on here.

01:09:10.313 --> 01:09:17.828
So what this experiment does is it says
okay, let's take some images of smiling,

01:09:17.828 --> 01:09:26.628
samples of smiling women images and then let's take some samples
of neutral women and then also some samples of neutral men.

01:09:28.341 --> 01:09:34.920
And so let's try and do take the average of the Z
vectors that produced each of these samples and if we,

01:09:34.920 --> 01:09:45.037
Say we take this, mean vector for the smiling women, subtract the mean vector for
the neutral women and add the mean vector for the neutral man, what do we get?

01:09:46.651 --> 01:09:49.884
And we get samples of smiling man.

01:09:49.884 --> 01:09:56.200
So we can take the Z vector produced there,
generate samples and get samples of smiling men.

01:09:57.190 --> 01:10:03.879
And we can have another example of this. Of glasses
man minus no glasses man and plus glasses women.

01:10:05.918 --> 01:10:08.763
And get women with glasses.

01:10:08.763 --> 01:10:18.358
So here you can see that basically the Z has this type of interpretability
that you can use this to generate some pretty cool examples.

01:10:20.026 --> 01:10:23.967
Okay so this year, 2017 has really been
the year of the GAN.

01:10:24.842 --> 01:10:33.261
There's been tons and tons of work on GANs and it's really
sort of exploded and gotten some really cool results.

01:10:33.261 --> 01:10:38.680
So on the left here you can see people
working on better training and generation.

01:10:38.680 --> 01:10:45.621
So we talked about improving the loss functions, more
stable training and this was able to get really nice

01:10:47.216 --> 01:10:50.173
generations here of different
types of architectures

01:10:50.173 --> 01:10:54.326
on the bottom here really
crisp high resolution faces.

01:10:54.326 --> 01:11:01.742
With GANs you can also do, there's also been models on
source to try to domain transfer and conditional GANs.

01:11:01.742 --> 01:11:08.363
And so here, this is an example of source to try to get
domain transfer where, for example in the upper part

01:11:08.363 --> 01:11:14.703
here we are trying to go from source domain
of horses to an output domain of zebras.

01:11:14.703 --> 01:11:25.813
So we can take an image of horses and train a GAN such that the output is going
to be the same thing but now zebras in the same image setting as the horses

01:11:28.408 --> 01:11:33.124
and go the other way around.
We can transform apples into oranges.

01:11:33.124 --> 01:11:38.608
And also the other way around. We can
also use this to do photo enhancement.

01:11:38.608 --> 01:11:52.379
So producing these, really taking a standard photo and trying to make really nice, as if you had,
pretending that you have a really nice expensive camera. That you can get the nice blur effects.

01:11:52.379 --> 01:12:03.750
On the bottom here we have scene changing, so transforming an image of
Yosemite from the image in winter time to the image in summer time.

01:12:03.750 --> 01:12:05.753
And there's really tons of applications.

01:12:05.753 --> 01:12:16.373
So on the right here there's more. There's also going from a text description and
having a GAN that's now conditioned on this text description and producing an image.

01:12:18.343 --> 01:12:26.421
So there's something here about a small bird with a pink breast
and crown and now we're going to generate images of this.

01:12:26.421 --> 01:12:37.383
And there's also examples down here of filling in edges. So given conditions on some
sketch that we have, can we fill in a color version of what this would look like.

01:12:40.848 --> 01:12:50.416
Can we take a Google, a map grid and put something that looks like Google
Earth on, and turn it into something that looks like Google Earth.

01:12:52.528 --> 01:12:56.767
Go in and hallucinate all of these
buildings and trees and so on.

01:12:56.767 --> 01:13:07.061
And so there's lots of really cool examples of this. And there's also this website
for pics to pics which did a lot of these kind of conditional GAN type examples.

01:13:08.077 --> 01:13:17.549
I encourage you to go look at for more interesting
applications that people have done with GANs.

01:13:17.549 --> 01:13:24.640
And in terms of research papers there's also there's
a huge number of papers about GANs this year now.

01:13:26.047 --> 01:13:31.365
There's a website called the GAN Zoo that kind
of is trying to compile a whole list of these.

01:13:31.365 --> 01:13:44.794
And so here this has only taken me from A through C on the left here and through like L on the right. So
it won't even fit on the slide. There's tons of papers as well that you can look at if you're interested.

01:13:44.794 --> 01:13:57.376
And then one last pointer is also for tips and tricks for training GANs, here's a nice
little website that has pointers if you're trying to train these GANs in practice.

01:14:01.313 --> 01:14:06.915
Okay, so summary of GANs. GANs don't
work with an explicit density function.

01:14:06.915 --> 01:14:13.989
Instead we're going to represent this implicitly through
samples and they take a game-theoretic approach to training

01:14:13.989 --> 01:14:18.973
so we're going to learn to generate from our training
distribution through a two player game setup.

01:14:18.973 --> 01:14:26.212
And the pros of GANs are that they're really having gorgeous
state of the art samples and you can do a lot with these.

01:14:26.212 --> 01:14:33.247
The cons are that they are trickier and more unstable
to train, we're not just directly optimizing

01:14:36.499 --> 01:14:41.830
a one objective function that we can
just do backpop and train easily.

01:14:41.830 --> 01:14:47.710
Instead we have these two networks that we're trying to
balance training with so it can be a bit more unstable.

01:14:47.710 --> 01:14:57.629
And we also can lose out on not being able to do some of the inference
queries, P of X, P of Z given X that we had for example in our VAE.

01:14:57.629 --> 01:15:07.040
And GANs are still an active area of research, this is a relatively new type of
model that we're starting to see a lot of and you'll be seeing a lot more of.

01:15:07.040 --> 01:15:20.633
And so people are still working now on better loss functions more stable training, so Wasserstein
GAN for those of you who are interested is basically an improvement in this direction.

01:15:22.224 --> 01:15:31.489
That now a lot of people are also using and basing models off of. There's also
other works like LSGAN, Least Square's GAN, Least Square's GAN and others.

01:15:31.489 --> 01:15:39.307
So you can look into this more. And a lot of times for these new models in
terms of actually implementing this, they're not necessarily big changes.

01:15:39.307 --> 01:15:44.279
They're different loss functions that you can change a
little bit and get like a big improvement in training.

01:15:44.279 --> 01:15:51.500
And so this is, some of these are worth looking into and
you'll also get some practice on your homework assignment.

01:15:51.500 --> 01:15:59.946
And there's also a lot of work on different types of conditional GANs
and GANs for all kinds of different problem setups and applications.

01:16:01.648 --> 01:16:05.807
Okay so a recap of today.
We talked about generative models.

01:16:05.807 --> 01:16:12.329
We talked about three of the most common kinds of generative
models that people are using and doing research on today.

01:16:12.329 --> 01:16:17.588
So we talked first about pixelRNN and
pixelCNN, which is an explicit density model.

01:16:17.588 --> 01:16:26.981
It optimizes the exact likelihood and it produces good samples but
it's pretty inefficient because of the sequential generation.

01:16:26.981 --> 01:16:35.090
We looked at VAE which optimizes a variational or lower bound on the
likelihood and this also produces useful a latent representation.

01:16:35.090 --> 01:16:40.305
You can do inference queries. But the
example quality is still not the best.

01:16:40.305 --> 01:16:47.657
So even though it has a lot of promise, it's still a very
active area of research and has a lot of open problems.

01:16:47.657 --> 01:16:57.375
And then GANs we talked about is a game-theoretic approach for training
and it's what currently achieves the best state of the art examples.

01:16:57.375 --> 01:17:05.047
But it can also be tricky and unstable to train
and it loses out a bit on the inference queries.

01:17:05.047 --> 01:17:10.239
And so what you'll also see is a lot of recent
work on combinations of these kinds of models.

01:17:10.239 --> 01:17:12.733
So for example adversarial autoencoders.

01:17:12.733 --> 01:17:18.478
Something like a VAE trained with an additional adversarial
loss on top which improves the sample quality.

01:17:18.478 --> 01:17:32.444
There's also things like pixelVAE is now a combination of pixelCNN and VAE so there's a lot
of combinations basically trying to take the best of all these worlds and put them together.

01:17:32.444 --> 01:17:40.449
Okay so today we talked about generative models and next
time we'll talk about reinforcement learning. Thanks.